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Abstract 

A prototypical blind signal separation problem is the so-called cocktail party problem, with 
n people talking simultaneously and n different microphones within a room. The goal is to 
recover each speech signal from the microphone inputs. Mathematically this can be modeled by 
assuming that we are given samples from a n-dimensional random variable X = AS, where S 
is a vector whose coordinates are independent random variables corresponding to each speaker. 
The objective is to recover the matrix A -1 given random samples from X. A range of techniques 
collectively known as Independent Component Analysis (ICA) have been proposed to address 
this problem in the signal processing and machine learning literature. Many of these techniques 
are based on using the kurtosis or other cumulants to recover the components. 

In this paper we propose a new algorithm for solving the blind signal separation problem in 
the presence of additive Gaussian noise, when we are given samples from X = ^4S + rj, where 
r] is drawn from an unknown n-dimensional Gaussian distribution. Our approach is based 
on a method for decorrelating a sample with additive Gaussian noise under the assumption 
that the underlying distribution is a linear transformation of a distribution with independent 
components. Our decorrelation routine is based on the properties of cumulant tensors and can 
be combined with any standard cumulant-based method for ICA to get an algorithm that is 
provably robust in the presence of Gaussian noise. We derive polynomial bounds for sample 
complexity and error propagation of our method. 

Our results generalize the recent work of Arora et al. [1] which deals with a special case of 
ICA when S is the uniform probability distribution over the binary cube. 



1 Introduction and related work 

A prototypical blind signal separation setting is the so-called cocktail party problem: in a room, 
there are n people speaking simultaneously and n microphones, with each microphone capturing a 
superposition of the voices. The objective is to recover the voice of each individual speaker. The 
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simplest modeling assumption is to consider each speaker as producing a signal to be a random 
variable independent of the others and to take the superposition to be a linear transformation 
independent of time. This leads to the following problem: given a sample from n-dimensional random 
variable X, satisfying X = ^4S, where A is a non-singular square matrix and S is another random 
vector whose coordinates are unknown independently distributed (but not necessarily identical) 
random variables, we need to recover the matrix A^ 1 . Equivalently, we need to recover the basis 
corresponding to the directions of the independent components. 

The name Independent Component Analysis refers to a broad range of algorithms addressing 
this signal separation problem as well as its variants and extensions. It has generated significant 
interest and an extensive literature in the signal processing and machine learning communities due 
to its applicability to a variety of important practical situations including speech [T3], vision [5] 
and various biological and medical applications, e.g., For a comprehensive introduction see the 

books num. 

One widely used class of algorithms for ICA is based on the remarkable fact that if the data is 
whitened, that is, X has the zero mean and the identity covariance matrix, then the absolute value 
of kurtosis reaches its maximum in the directions corresponding to the independent components. 
More precisely, consider the kurtosis as a function on the 71-dimensional unit sphere. For whitened 
data it can be defined as follows: 

v i-> k 4 (v • X) := E((v • X) 4 ) - 3 

It can be shown 0^ that the vectors corresponding to the maxima of the absolute value of k,j(v-X) 
form an orthonormal basis whose elements are independent random variables. Thus the underlying 
structure of the signal can be recovered by analyzing the behavior of this function. Moreover, 
computing the kurtosis involves the expected value of the fourth power of a random variable, which 
can be easily approximated from a finite sample. 

This observation leads to the following procedure for the Independent Component Analysis in 
the noiseless case: 

Step 1. "Whiten" the original signal, that is apply a linear transformation which transforms the 
covariance matrix of the sample to the identity. This is typically achieved by using the Principal 
Component Analysis (PCA) to transform the input data to the basis of its principal directions by 
an orthogonal transformation and rescaling the resulting data appropriately. 

Step 2. After the signal is whitened various optimization procedures can be used to find the maxima 
of the absolute value of kurtosis over the unit sphere. The independent components are recovered 
from the directions of these maxima. 

In their recent paper [T] Arora, et al. make an important observation that for a slight variation 
of Step 2 to work, it is sufficient for the the sample to be decorrelated (quasi-whitened), that is, 
to have independent coordinates in some orthogonal basis, rather then fully whitened (having the 
identity covariance matrix). 

In this paper we consider the problem of signal separation for a noisy signal X = AS + 77, where 
77 is an unknown n-dimensional Gaussian distribution. The main difficulty is in Step 1, since the 
principal directions given by PCA are contaminated by the noise and do not generally decorrelate 
the underlying signal. Interestingly, as a result of the invariance of the kurtosis under the additive 
Gaussian noise, Step 2 of the algorithm is still valid and the usual methods and analyses still apply 
with minor caveats. 

The main contribution of our paper is addressing the problem of decorrelating the underlying 
signal in the presence of noise. We show how to approximate a matrix B, such that B~ 1 A is diagonal 
in the basis of independent coordinates. We provide polynomial bounds for the sample complexity 
and error analysis as well as an analysis of error propagation compatible with any analysis of Step 
2. 

Our approach can be viewed as a noise-invariant version of PCA for the special case when the 
underlying probability distribution is a product of independent variables. The method is based on 
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the properties of the fourth cumulant tensor, rather than the usual covariance matrix used in PCA. 
To the best of our knowledge, this is the first general algorithm for noisy IC A with sample complexity 
and running time guarantees. Moreover, unlike methods such as |20j . our approach is compatible 
with any optimization procedure for the Step 2. 

Related work. Over the last twenty years blind signal separatiorQ has become a large and active 
area of research in signal processing and machine learning community. An important class of methods 
for ICA is based on the properties of kurtosis and other higher-order cumulants. 

Most of these works concentrate on algorithms, implementations and applications and do not pro- 
vide a sample or running-time complexity analysis for the algorithms. One such analysis is provided 
in Frieze, et al. [BJ, where the authors address the question of learning a linear transformation, which 
is equivalent to the ICA problem, and provide a complexity analysis. In a slightly different context of 
cryptoanalysis, |16j analyzes a kurtosis-based method method for learning a parallelopiped. In |19j 
the authors analyze a generalized version of ICA for learning higher-dimensional subspace "juntas" 
in the presence of noise. 

The problem of blind signal separation in the presence of noise has been an active topic of 
research in the machine learning literature. In particular we would like to point out the work of 
Yeredor [2UJ which proposes an elegant one-step approach for general ICA with Gaussian noise, based 
on approximating the Hessian of the second characteristic function, namely v H> V 2 , \ogK x (e v x ), at 
a finite number of generic choices of v. The recent work of Hsu and Kakade [7J Section 3, Theorem 
3] proposes an approach similar to Yeredor's, using the Hessian of the directional kurtosis instead of 
the second characteristic function and makes interesting connections to learning Gaussian mixture 
distributions in high dimension. Finally, the sample complexity analysis for noisy ICA is studied 
in Arora et al. pQ, which provides a complete discussion for the special case when the underlying 
signal is a uniform distribution over the n-dimensional binary cube { — 1, 1}™. 

We also would like to point out that our approach is closely related to the class of tensor methods 
for data analysis, see e.g. [17j [15] 

2 Properties of Cumulants 

Let </>x(t) = E[cxp(it T X)], t € M™ denote the first charateristic function of a n-dimensional vector 
valued random variable X, and let V'x(t) = log(0x(t)) denote the second characteristic function of 
X. Cumulants are defined as the coefficients of the Taylor Expansion of the second characteristic 
function. Specifically, using the multi-index notation, we write 



For each cumulant Cum(Xi 1 ,--- , Xj r ), r is referred to as the order of the cumulant. Order r 
cumulants of a random variable X can be collected into a cumulant tensor, called the r th cumulant 
tensor of X. For instance, the fourth order cumulant tensor of X, denoted by Qx in this paper, 
is defined by (Qx)yJM — Cum(Xj, Xj, X/., X;). Since any simultaneous draw of random variables 
can be viewed as a draw of a single vector-valued random variable, this definition can be used to 
construct cross cumulants between arbitrary random variables. In the univariate case in which X 
and t are scalars, the notation K r (X) is used to denote the r th order cumulant Cum(A, • • • , X). 

Cumulants are similar in flavor to moments, and indeed all cumulants have polynomial expansions 
in terms of the moments of the same and smalle order. For example, the fourth cumulant (kurtosis) of 
a 0-mean one-dimensional random variable X can be expanded n 4 (X) — E[Y 4 ] — 3E[A 2 ] 2 . However, 
cumulants have nice algebraic properties not shared by moments, properties on which this work 

1 Also known as Blind Source Separation. 
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relies heavily. Let Xi,- ■ ■ ,X r be real- valued random variables. Then, cross cumulants are known 
to manifest the following properties: 



1. (Multilinearity) If a € K is a constant, then 



Cum(Xi, • • • , CiXi, • • • , X r ) = CiCum(Xi, ■ ■ ■ 




Also, if Yi is a random variable, then 



Cum(AV- - ,Xi + Y ir - 
= Cum(Xi, • • • , Xi, ■ ■ 



■ ,X r ) + Cum(X 1 ,--- ,Y h --- 



X r ). 



2. (Independence) If 2 variables Xi and Xj (i < j) are independent random variables, then the 
cross cumulant Cum(Xi, • • • , Xi, • • • , Xj, • • • , X n ) is zero. Combined with the multilinearity 
property, this implies that if the variables Y\, ■ ■ ■ ,Y n are independent of X\, ■ ■ ■ , X n , then 



3. (Vanishing Gaussians) The only non-zero cumulant tensors of Gaussian random variables are 
the 1-tcnsor mean and the 2-tensor covariance matrix. 

Note that in the univariate case, these properties become: 

1. (Additivity) If X and Y are independent random variables, then n r (X + Y) = K r (X) + K r (Y). 

2. (Homogeneity of degree r) If c is a constant, then n r (cX) — c r n r (X). 

3. (Vanishing Gaussians) The only non-zero cumulants of a Gaussian random variable are the 
mean and the variance (the first and second order cumulants). 

3 Problem Statement and Main Result 

Let x« x< 2 ) ••• , xW S R™ be an i.i.d. TV-sample of vector- valued random variables. In independent 
component analysis (ICA) it is assumed that each x« is generated from a latent random variable 
via an unknown mixing matrix A such that 



where r) is additive noise. The latent random variable S is typically assumed to be a vector in K"; 
though in principle, it could be a vector in any space IR m where m < n. The individual components of 
S are assumed to be independent random variables. A is taken to be a full rank matrix, A E M. nxrn . 
It will be assumed for simplicity that m = n, thus making A invertible. We will further assume 
that each random variable Si has variance 1. Note that this last assumption serves to remove an 
ambiguity of the problem, since the columns of A could otherwise be chosen to have any scale. As 
a result of these assumptions, the covariance Cov(S) becomes the identity matrix /. 

As discussed in the introduction, most ICA algorithms can be broken down into 2 steps. In the 
first step, the independent components are made orthogonal and rescaled such that X = RS where 
R is an orthogonal matrix. This method of decorrclating the independent components is termed 
whitening. In the second step, the columns of R (which correspond to independent components) up 
to sign and order are found. 

In the noisy case the main challenge is presented by Step 1, as Step 2 for kurtosis-based methods 
is naturally invariant to Gaussian noise. Since additive Gaussian noise affects the covariance matrix 



Cum(X 1 + Y u X 2 + Y 2 , ■ ■ ■ ,X n + Y n ) 

= Cum(A 1 ,A 2 ,-.. ,X n ) + Cum(Y 1 ,Y 2 ,--- ,Y n ). 



x 



(0 = As {i) + ri 
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Cov(AS + tj), PCA based whitening fails to orthogonalize the independent components. It was 
observed in [I] that a variation on step 1 could be used. It is enough to make the independent 
components orthogonal without giving them the same scale. Whereas true whitening sets X = RS, 
we replace R with RD such that R is orthogonal and D is a diagonal scaling matrix. Thus, following 
PQ, quasi-whitening can be defined as follows: 

Definition 3.1. A quasi-whitening matrix is a matrix W such that W A = RD for some orthogonal 
matrix R and nonsingular diagonal matrix D. 

We shall now state our main result. Let ei, ■ • • , e„ be the canonical vectors that form a basis 
for the space spanned by the random vector S. Let K m j n = minj(|/«4(Si)|), K m ax = maXjfl/e^Sj)!), 
and /ifc = ruaXj(E[Sj ]). Let Aj denote the i th column of matrix A. For clarity of the presentation, 
we use the following machine model for the running time: a random access machine that allows the 
following exact arithmetic operations over real numbers in constant time: addition, substraction, 
multiplication, division and square root. 

Theorem 3.2. Let e > and S € (0, 1). Given 

samples of X = AS + t] we can compute, in time polynomial in N and n, an approximate quasi- 
whitening matrix B so that with probability at least 1 — S over the sample we have 

1. For i ^ j , 

(B~ 1 Ae l , B~ 1 Ae i ) 

-e<^ 3 -!—<t 1 

" WB-iAeMB^Aejh ~ 

2. The length of e^ is scaled under the transformation B~ 1 A as: 

(1 - e)\\Ml < US-Ue^ < (l + e)\\AA\l (2) 

In simpler words, quasi-whitening approximately orthogonalizes the independent components of 
X and scales the independent components based on the lengths of the columns of A. 

We note that existing cumulant-based methods already employed for step 2 in ICA can be 
modified in reasonably straightforward ways to work under quasi-whitening. Several popular ICA 
algorithms including JADE [3] and the kurtosis based implementation of FastICA [8j [10] are imple- 
mented using cumulants. Since higher order cumulants ignore Gaussian noise, this allows for the 
creation of a class of new algorithms which are resistant to additive Gaussian noise. In the special 
case where each S is drawn uniformly from {—1, 1}™, this has been done by Arora et al in pQ. 

To see the validity of fourth cumulant based algorithms for the second step of ICA in the presence 
of Gaussian noise, we draw from Observation 2 of Frieze et al in [6]. An interpretation of the 
statement and proof is that given ai, «2, . . . , a„ € R such that each ^ and a function of the 
form G(v) = Y^i=i v t a i sucri that v is drawn from the unit sphere, when there exists some > 0, 
a complete list of local maxima of G(v) is given by {ie, : oti > 0} (where e, is the ith. canonical 
vector). Similarly, when there exists some ai < 0, a complete list of local minima of G(v) is given 
by {±e^ : a>i < 0} Using the properties of cumulants, it follows that given v G R™ drawn from the 
unit sphere, 

n 

« 4 (vS) = ^v^ 4 (S i ), (3) 

i=l 

2 Hyvarinen had a different definition of quasi- whitening in [9]. 
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where Ki(Si) takes on the role of on. As such, any algorithm which maximizes |ft4(v ■ S)| or alterna- 
tively k 4 (v • S) 2 will find the canonical vectors. Of course, one cannot work in the coordinate system 
of S, but under the assumption of orthogonality provided, one can instead maximize |k 4 (u-X)| 
where u is drawn form the unit sphere since 

k 4 (u • X) = k 4 (u • (ADS)) = K 4 ((i? T u) • (DS)) 

using that additive Gaussian noise is ignored by cumulants. DS is simply a rescaling of S, and 
K 4 (Si) can be replaced by K^duSi) in equation Using the change of variable v = R T u, any 
locally maximal value for u will correspond to a column of i?, thus recovering a component Si up to 
scaling and noise. In [5J, Observation 2 summarizes a very similar result in the case of true whitening 
without additive Gaussian noise using the fourth moment instead of fourth cumulant, and a mostly 
correct efficient algorithm and analysis is provided for the fourth moment based on this observation. 

4 How to Achieve Quasi- Whitening 

Recall that Qx denotes the fourth cumulant tensor of the observed variable X, with ijkl th entry: 

{Qx)ijki = Cum(Xj, Xj, Xfc, X;), 

and define an operation of tensors on matrices T x K nx ™ —> R™ xn by: 

n 

(Qx°M)ij = Cum(X,-,Xj,X/ s ,X;)m/fe. 
fc,i=i 

Before proceeding with the argument leading to the construction of a quasi-whitening matrix, 
it is worth making several observations about this operation. First, the operation can be viewed 
as matrix- vector multiplication. Use multi-indices a,/3 such that a runs over and /3 runs over 
(l,k), and note that by symmetry, (Qx)ijkl — {Qx)ijlk = {Qx)a0- Under this flattening of the 
tensor Qx, the operation becomes matrix- vector multiplication with M taking on the role of the 
vector using mik = m a . 

The following Lemma describes how the cumulant tensor transforms under a linear change of 
variable: 

Lemma 4.1. Given a random vector-valued variable Y £ K™ and matrices B,M S R™ xn , then 
Q SY o M = S(Q Y o {B t MB))B t . 

Proof. The proof follows primarily from the multilinearity of cumulants: 

n 

{Qby ° M)ij = Cxim((BY) t ,(BY) J ,(BY) k ,(BY) l )m lk 

fe,i=i 

n n 

= ^ X! Cum(b iq Y q ,bj r Y r ,bk s Y s ,bi t Y t )mik 

k,l=l q,r,s,t=l 
n n 

= ^ ^ b lq b jr Cum(Y q ,Y r ,Y s ,Y t )bi t mi k b ks 

k,l=l q,r,s,t=l 
n 

= b lq b jr Cum(Y q ,Y r ,Y s .Y t )(B T MB) ts 

q.r 7 s.t—l 

n 

= J2 biqbjr(Qv ° {B T MB)) qr , 
q,r—l 

which can be equivalently written as Qby ° M = B(Q^ o (B T MB))B T . □ 
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The above ideas will be useful both in constructing a quasi-whitening matrix in the noiseless 
case, as well as in finding an estimate to a quasi-whitening matrix from data. What follows is the 
construction of a quasi-whitening matrix when one knows the cumulant tensor exactly. 

Lemma 4.2. Let M be an arbitrary matrix. Then, Qx ° M — ADA T where D is a diagonal matrix 
with q th entry ol qq = K i (S q )A^ MA q . 

Proof. This proof will proceed by simplifying Qx ° M using the properties of cumulants. 

n 

(Qx.oM)ij= ^ Cum(Xi,X i ,Xfe,X;)mife 

k,l=l 

n / n n 

= ^ Cum I ^AigSq + ^,^2 A jq S q +7]^ 
k,l=l \q=l q=l 

n n \ 

^ A kq S q + Vk,Yl A iq s q + Vi mik 

q=l 9=1 / 

n n 

= ^2 Cnm ( A iqS q , A jq S q , A kq S q , A lq S q )m lk 

k,l=l q=l 
n n 

= Y ^ A iq A jq A kq A lq Cum(S q , S q , S q , S q )mi k , 

k,l=l 9=1 

where the last two equalities come from the independence, multilinearity, and vanishing Gaussian 
properties. Switching into univariate cumulant notation and rearranging summations yields: 

n n 

(Qx ° M)ij = ^ A iq A jq K 4 (S q ) 2J Ai q mi k A kq 

q=\ k,l 
n 

= A iq A jq K 4 (S q )A^MA q 

q=l 

which has matrix form: 

Qx o M = ADA T 

where D is a diagonal matrix with diagonal entries d qq = K4(S q )A^MA q . □ 

Theorem 4.3. Let M be the matrix (Qx ° I) • Let B be a factorization matrix such that BB T — 
Qx°M. Then, B^ 1 is a Quasi- Whitening matrix. 

Proof. Applying Lemma 14.21 gives Qx ° I = AD'A T with d' qq = K 4 (S q )A q ■ A q . Note that M = 
(A T y 1 D'- 1 A~ 1 . Applying Lemma I4T21 a second time yields Q x o M = ADA T where d qq = 
K4(S q )A^MA q gives the diagonal elements of D. Manipulating d qq yields: 

d qq = Ki (S q )A T q {A T Y 1 D'- 1 A- 1 A q 

= K4 (S g )e^p')^e q 
= K 4 (S q )[n 4 (S q )A q -A q r 1 
1 



\A q \\l 



Note that d qq is a positive number for each diagonal entry of D. D 1 / 2 exists and can be uniquely 
defined by taking the positive square root of all diagonal entries. Letting B be any factorization 
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matrix such that BB T = Q x ° ((Qx ° i") -1 ) = ADA T , then / = B' 1 AD 1 / 2 {B~ 1 AD 1 ' 2 ) 7, gives 
that B^AD 1 ' 2 = R for some orthogonal matrix R. Hence, B 1 A = RD 1 ' 2 gives that B 1 is a 
quasi-whitening matrix. □ 



5 Estimation of Cumulants 

So far we have shown that given exact knowledge of the fourth order cumulant tensor for the random 
variable X = ylS + rj, it is possible to find a quasi-whitening matrix B^ 1 such that B^ 1 A = RD 
for some orthogonal and diagonal matrices R and D respectively. In practice, one does not have 
exact knowledge of the cumulant tensor, and the cumulant tensor thus needs to be estimated from 
samples. Cumulants can be estimated using fc-statistics, which are unbiased estimates of cumulants. 
fc-statistics have been studied within the statistics community, and are discussed in the chapter 4 of 
|14j . For the fourth order cumulant tensor, given random variables Yi, Y2, • ■ ■ , Y„, the fc-statistic 
fc(Yj, Yj, Yfc, Yi), which estimates Cum(Yi, Yj, Yk, Yj), is: 



where is a function invariant under permutations of its indices defined by (j>(i, i, i, i) — 1, (f>(i, i, i, j) = 



<Ki,i,j,j) = -l/(N-l), 4>{i, k) = 2/[(iV-l)(iV-2)],and0(i, = -6/[(N-l)(N-2)(N-3)} 



when i,j,k,l € [N] arc distinct [T3]. 

fc-statistics share several important properties with the cumulant tensors that they estimate. The 
fc-statistic is symmetric in that k(Xi, Xj , X k , X{) is invariant under reordering of indices, and it is 
also multilinear. Multilinearity is shown for the fourth fc-statistic in the following Lemma. 

Lemma 5.1. The k -statistic transforms multilinearly. 

Proof. There are 2 properties of multilinearity. For simplicity of notation, they will be only shown 
on the first coordinate of the k-statistic function. Let Y^, Yj, Yfc, Yi, Zj be random variables, and 
let cel. Then 

1. The additivity portion of multilinearity comes from: 
fc(Y l + Z,,Y J ,Y fc ,Y ; ) 



fc(Y l ,Y„Y fc ,Y) = - J2 ^.M.tOyWM'M 



r,s,t,u— 1 



1 



N 



N 



N N 



1 X! 4- \ (r) («) (*) («) , if ± \ 0) (s) (t) ( 




fc(Y l ,Y„Y fc ,Y) + fc(Z l ,Y J ,Y fc ,Y ; ). 



2. The multiplicative portion of multilinearity comes from: 



k( C Y h y^ Yfc, y,) = - ^ s > *» ^y^y^ yj? M 



r,s,t,u— 1 




r,s,t,u— 1 



= fc(Y i ,Y i ,Yfc,Y i ). 
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□ 



These multilinearity properties imply that 

fc(Xj,X,',Xfc, X;) 

= HA tq (S + A^ 1 r]) q , A jr {S + A~ 1 rj) r ,A ks (S + A~ 1 rj) s , A it (S + A" 1 ^) 



qrst 



J2A iq A jr A ks Ai t k(S q + {A 1 r]) q ,S r + (A 1 r 1 ) r ,S s + {A 1 r ] ) s ,S t + (A 1 r 1 ) t ). 

qrst 



As such, Lemma 14.11 applies also to fc-statistic estimates of random variables. In particular, it is 
possible to think of the fc-statistic tensor associated with the random variable X as being generated 
from an unobserved fc-statistic tensor from the latent samples of S + A~ lr q. We can work directly 
with the random variable S + A^ 1 ^ for the purposes of error analysis. This will be a natural 
approach since the difficulty of the problem relies partially on the kurtosis of the latent distribution 
for S + A- 1 T 1 . 

Let Hk represent maxiE[Sf]. By assumption, /ii = and [12 = 1. Let 77* denote A _1 t?. Let 



<r„* = max (a/u t S t) »u) 
|| =x V ' 

where T, n * is the covariance matrix of rj* . The error induced by estimating the latent fourth cumulant 
tensor fcs+17* from a sample can be bounded using the following 2 Lemmas: 

Lemma 5.2. Let Z = S + 77*. Then, 

max,-(:Li WiZf 1 
Var(fc(Z l , Z„ Z fe , Z,)) = 0( * e ff 1 

Proof. In order to save space, it will be useful to use multi-index notation. In particular, taking 
I = {ii, ii-, *3) h) € [ n ] 4 an d ct = (ai, Q!2, «3, 04) € [N] 4 , <fi a z i will be denote 

(«l) _("2) (a 3 ) (04) 



Further, the set aD f3 will be defined as: 

a n (3 = {at : on = f3j for some pair 
Keeping these notations in mind, we can proceed with the proof. Let / € [n] 4 . 
Var(fc(Z Jl ,Z l2 ,Z l3 ,Z J4 )) 



= E 



\ "£[W] 4 / 



— E 



]v E 



(«) 



iV 2 



t>e[jv] 4 /3e[jv] 4 



aEi[iV] 1 

1 

iV 2 " 



03) 1 



aeW] 4 /3e[w] 4 



a6[AH 4 /3£[»] 4 



^ E ^ E ^[Wi- 



a£[JV] 4 t*e[iv] 4 



(4) 
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Equation Q contains the essence of the argument. However, in order to complete the argument, 
several facts need to be demonstrated. First, it needs to be seen that \E[z i j' ) z i f ) }\ < maxj(E[Zf]). 
To see this, use the Cauchy-Schwartz inequality on random variables Y\ 1 Y2 to get: 

E[FiF 2 ] < max(E[r l 2 ],E[F 2 2 ]) 



(5) 



Applying this fact recursively yields that 



< maxi(E[Zf]). 



The second difficulty that arises is seeing how limiting oneself to samples in which a n f3 ^ 
restricts the summation. First, let dist(/3) denote the number of distinct indices in ft. If c = dist(/3), 
then there are ( N ) choices of index values that can be used to generate j3, of which ( N ~ 4 ) certainly 
do not intersect a. As such, 



gives an upper bound on the fraction of index sets in which fj n a 7^ when dist(/3) = c. Finally, 
noting that \<j>p\ < 7/(N dist( -^- 1 ) for sufficiently large N and that Y, a e[N]* = °( N )> we have 
sufficient tools with which to proceed from (Q} : 



Var(fc(Z il ,Z l2 ,Z J3 ,Z l4 )) < — 



E ^e E m&'F: 

ae\N] 4 c=l dist(/3)=c 



ae[jV] 4 



C=l dist(/3)=c 



4 / JV 



^^ E[z * ] E 1^1 E 



ae[N] 4 







7N- C+1 E 1 

dist(/3)=c 



0[-^max(E[Z8])A- 1 A- c + 1 A c E 1^1) 



o 



max ieW (E[Z8]) 



N 



□ 



Lemma 5.3. Given e, 5 > 0, the error of each term in the k-statistic tensor for S + A 1 tj can be 
bounded beneath e with confidence 1 — 8 using 



N = 



c 2 5 



A*8 



O-min(^) 8 



samples. 

Proof. Define Z = S + rj*. Then using LemmaS Var(fc(Z j; , Z jt Z fe , Z ; )) = 0(i max, e[n] E[Z 8 ]). 
Using the binomial expansion, 



m— 

4 

E 

m=0 



2m 



2m /„* \ 8 — 2mi 



[sr«) 
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since odd 0-mean Gaussian moments are 0. Using equation ([S]), we see that the dominant terms are 
fis and (<T n / 'cr m in(^4)) 8 ■ In particular, the cross terms come from E[S^™ 1 ^_ 2m ]. When m = 2, from 
([5]), it follows that 

When m = 1, then 



E^!) 4 ] <max(/i 8 ,E[(r,*) 8 ] 



E[S^(r,p e ]=E[(S2(, 7 ;) 2 )(^) 4 i 



< 



max(E[S*«) 4 )],E[(r,*) 8 )], 



for which E[S 4 (»7*) 4 )] < max(/j 8 , E[(r;*) 8 ]) has just been shown. The case m = 3 can be argued 
similarly to m = 1 interchanging the roles of S and r)*. Thus, one gets: 

E[Z«]=0( M8 +E[(r,;) 8 ]). 

For even Gaussian moments, the following equation holds (see for instance [T2] section 3.4): 

Tffl„2k] (2fc)! 2fc 

- 

It follows that 

E[Z s q }=0( t i s +a s ri ,) 

= 0{n8+<T*/<j mia (Af). 

Chebyshev's inequality states that for a random variable Y, Pr(|F — (j,y\ > C(J y) < -7- Taking F 
to be fc(Si + 77*, Sj + 77*, S fc + 77*, S; + 77*), then since the ^-statistic is unbiased, it follows that its 
expectation is Cum(S.; + 77*, Sj + 77*, Sfe + 77*, S; + 77*) — Cum(Sj, Sj, Si), c can be chosen such 
that S/n 4 > 1/c 2 . Then, in order to bound the error beneath e, it suffices to satisfy: 



e > cJVax(k(Si + 77?, Sj + 77*, S fc + tj!, S, + »tf )), 



which can be guaranteed by choosing TV such that e > cO((if max gg r n i 

(E[Zf])) 1 /2). This leads to 

the expression: 



c0 Uh$ E[zl]) p e 



N > O 



4 M8 + K/^mi„(^) 8 ) 



e 2 <5 



Applying the union bound, this number of samples is sufficient to guarantee with confidence 1 — 8 
that all terms in the fc-statistic tensor for S + A~ lr q can be bounded beneath e with confidence 
1-5. □ 



6 Error Propagation for Quasi- Whitening 

What follows is an analysis for how error propagates throughout the quasi-whitening algorithm. It 
will be demonstrated that the canonical vectors which act as a basis for the independent components 
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of S will remain approximately orthogonal after quasi-whitening given sufficiently many samples. It 
will be demonstrated that the required number of samples is polynomial in terms of -, j, n, k(A), 
^ max , ^-1— , - pWr , and where e is the allowable cosine error from orthogonality of the basis 
vectors, and 1 — S is the confidence of success. Since in the previous section, it was demonstrated 
that given any e > 0, the sample estimate of the cumulant tensor can have error bounded by e in 
each term, it suffices to demonstrate that at each step of the algorithm, error does not grow too fast. 
The confidence is unchanged since only one sample is taken. As a notation, hatted variables shall 
be used to denote approximations of non-hatted variables. It is assumed that the fc-statistic tensor 
Qs estimate of Qs is defined from samples of the noisy latent variable S + rj* = .A _1 X, though for 
simplicity, rj* is suppressed from the subscript notation. (See also the discussion after Lemma 15. II ) 

Lemma 6.1. Given a sample of X, let Qx and Qs be the associated k-statistic estimates for Qx 
and Qs respectively, and let M be an estimate for the matrix M such that for some €1,62 > 0, 
HQs — Qs||max < £1 and \\M — M\\ 2 < e 2 . There exists a matrix Y such that Qx M — AY A T 
and Qx M — ADA T where D is the diagonal matrix defined in Lemma \4-S\ and the error in the 
estimate Y is bounded as: 

\\Y-D\\ 2 <\\Y-D\\ F 

< n 2 \\A\\l\\M\\ F ei + >/nea\\A\\l( 

Proof. Using Lcmma l4~Tl one gets Qx°M = A(Qso(A T MA))A T , which gives that Y is well defined, 
and Y = {Qs ° (A T MA)). By similar reasoning, D — Qs o [A 7 MA). In the following investigation 
of error propagation, the tensors Qs and Qs will be treated as matrices as described in section HJ 
and the 2-norm used on the tensors should be interpreted as if the tensor has been flattened to its 
n 2 x n 2 matrix form. Then: 

\\Y-D\\ P = HQs o (A T MA) - Q s ° (A T MA)\\ F 

= HQs o (A 7 MA) - Qs o (A T MA) + Q s o (A T MA) - Q s o (A T MA)\\ F 

< HQs - Qs|| 2 ||(^ T M^)|| F + ||Q S ||2||A T (A/ - M)A\\ F 

< ^aWAWlWM - M + M\\ F + K max \\A\\lVne 2 

< n 2 ||A||2 £l (||M - M\\ F + \\M\\ F ) + V^/wlHHes 
<n 2 || J 4||2 ei (^ e2 + ||Af|| i ,) + Il^-Il2 e 2 

= n 2 ||A||l||M|| F ei + V^e 2 ||A||l(n 2 ei + K max ). 

This is also a bound for \\Y — £>|| 2 based on the standard inequality \\Y — D\\ 2 < \\Y — D\\ F . □ 

Lemma 16. II above bounds the error growth from tensor operations while placing all error on the 
diagonal matrix. The next goal is to demonstrate that taking the inverse of a matrix has reasonable 
error propagation properties. The following Lemma (a portion of Theorem 2.5 from |18j ) will be 
useful: 

Lemma 6.2. Let \\-\\ be any consistent matrix norm. Given a matrix C and a matrix perturbation 
E such that \\C~ 1 E\\ < 1, and given C = C + E, then 

lie- 1 !! -l-lic-^n 

From this Lemma, it follows immediately that if ||-E||2 < 1 / (2 1 1 C — 1 1 1 2 ) , then 

||c- 1 -c- 1 || 2 <2||c- 1 || 2 ||i : ;||2 (6) 

The main result of this paper is contained in Theorem 13. 2| which we prove now. 
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Proof of Theorem \3.SX This proof is long, and is split into 3 parts. In the first part, the preceding 
Lemmas are used to propagate error from the estimated latent tensor Qs- Then, a bound on the 
number of samples required to bound within e the cosine and scaling errors for the basis for the 
independent subspace from equations |T]) and is stated. Finally, it is demonstrated that the 
bound on angular error is correct. 

Let be a sample size to be chosen later as a function of an arbitrary parameter 77 > 0, so 
that with probability 1 — S we have ||Qs — <3s||max < V- Then, let D' — diag(K4(Si)|| J 4i || 2 , • • • , 
K4(S„)||A„|| 2 ) be the same as in the proof of Theorem [321 By Lemma R~2"l AD'A T = Q x o I, 
Let Y' be the estimate of D' generated as AY'A T = Qx I- Then by Lemma it follows that 
\\Y' — D'\\2 < n 5 / 2 \\ AW^rj. In order to apply equation ©, it is useful to get error bounds for 1 1 Z)' - 1 1 1 2 ■ 
It can be shown that: 

- a l (A) 2 * \\#-% < - (A)2 - (7) 

Then, it follows that using equation ([6]): 



< 



2W 1 - ^ \\2 

2n 5 / 2 ||A|| 2 77 



K min <T min(^) 4 
^2 



V 



2n 5 / 2 K (Ay 

K min <T min(^l) 2 



(8) 



with the restriction that 77 must be chosen such that \\Y' - D'\\ = n 5/2 \\A\\lr/ < 1/(2\\D'^ 1 \\ 2 ). This 
can be ensured by requiring that 77 < K m - ln /( 

Now, let Y and D be defined such that ADA T = Q x o(AD' A 1 *)- 1 and AY A T = Qy^o(AY' A T )~ 1 . 
By Lemma T6.1[ 

WY-DhKn'WAWlWiAD'A^- 1 ^ 

+ ^\\{AY l A T y 1 - (A^ T )- 1 || 2 ||A|| 2 (n 2 7 ? + « max ) 
< « 2 /*(A) 2 ||Z? / - 1 || jF 77 
+ y/nK(A) 2 \\Y'~ l - D'^hin 2 ^ + /c max ) 

rv>/ 2 K{A) 2 2n 3 K (A) 4 K max 2ti 5 k(A) 4 2 

( A\ \ 2 ^ ^ 2 / A \ 2 ^ ^ 2 ( A \ 2 ^ 

^A) K min a ^in{A) ^min^min [A) 



which follows by applying Q and ([8]). 
Since r\ < K min / '(2n 5 / 2 n(A) 2 ), 

\\Y — D\\ 2 — O ( n3K (^) 4Kmax 

\ cr min(^4) 2 K 2 nin 

Once again, it will be necessary to bound ||U||2 in order to apply equation ©. Using D = 
diag(l/||Ai|||, • • • , 1/||A„|| 2 ) from the proof of Theorem l4~3l it follows that: 

^min(^) 2 < \\D~% < <J m ^(A) 2 . 

Applying equation ([6]) yields: 

\\y- 1 -d-%<2\\d- 1 \\1\\y-d\\ 2 

7l 3 ht(A) 4 Kmax 
&min(A) 2 K mm 



lY^-D^h ^ _ /n 3 /c(A) 8 K max 



>(A) 2 " V «; 



2 

min 
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with the restriction that \\Y - D\\ 2 < 1/(2\\D- 1 \\ 2 ). Noting that ||Y-D|| 2 = O ( f ??) and 

1/||-D _1 ||2 > l/cr max (A) 2 , it suffices to restrict < O ( n a B ^„ )' 

M-y — 1 j-^ — 1 II 

Since r\ is arbitrary (except for upper bound restrictions), r\ can be chosen such that 11 - (A) 2 < - 
|. This can be accomplished taking r/ — O (^ n 4 K {jffi K — This choice is valid, as both restrictions 
on rj are met when e < 1. By Lemma 15.31 taking 



\ e 0K min / 



samples suffice to obtain the desired error bound e with probability 1 — 5. 

The basis in which S has independent coordinates is the canonical basis. Therefore, the ultimate 
goal is to show that, with our choice of an approximate quasi- whitening matrix B~ x below, the 
canonical vectors stay approximately orthogonal after applying B~ x A. To see this, factorize BB T = 
Ox ° (Qx ° I) 1 - i> is the approximate quasi- whitening matrix, and BB T = AY A T gives that 
B~ 1 AY 1 / 2 = R for some orthogonal matrix i?, and B~ 1 A = RY^ 1 / 2 . Since Y is symmetric, Y -1 / 2 
can be taken to be a symmetric matrix. Take to be canonical vectors. Define 5ij to be the 

delta function such that 

1 if i = j, 
otherwise. 



5jj 



Then with confidence 1 — 5, 

(B~ 1 Ae l , B~ 1 Ae j ) _ e[ A T B~ T B^ 1 Aej 



efy^e, 



DZ 1 f nr. 1 



\\A\\ 2 \\AA\ 2 2' ||Ai|| 2 ||^|| 2 
' S^WAWWAjW e Ml^llll^ 



I^IHI^II 2' II^IHIA, 




Consider the case where i = j. Then, 



\\B- 1 Ae i \\le(i±^) WMg 

which gives equation ^fy Consider the case where i ^ j. Then, 

(r'^.r'Ae,) WB^Ae^WB^Ae^h ± e 
||^||2pil|2 ' Ua-Meillall^-^Ha 6 2 

(B- 1 Ae i ,B- 1 Ae j ) ^ 



WB-iAeihWB-iAejh 2 1 ± e/2 
C ±e 

by restricting e < |. This gives equation (fl]), completing the proof. □ 
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